Urban Greenspace and High Blood Pressure Prevalence: Chicago, IL Case Study¶

Introduction¶

Vegetation in urban areas serves many ecological purposes such as reducing urban heat island effect, air quality, water quality, stormwater mitigation, and more. The results between specific human health outcomes and a simple vegetation metric such as average NDVI are often inconclusive. There is strong research to support that less access to greenspace in general has a negative affect on mortality, heart rate, and violence, and access to greenspace has a positive influence on attention, mood, and physical activity. There is less abundance of research studies and mixed results looking at greenspace access and chronic health conditions such as asthma, diabetes, and high blood pressure (Kondo et al., 2018). Additionally, less research looks at correlations at finer scale landscape metrics. Landscape metrics such as edge density, mean patch size, and fragmentation can be used to attempt to quantify these relationships between greenspaces and human health. These metrics can be used to explore the relationships between urban greenspaces and human health. This data analysis will explore these metrics at the census tract level in relation to high blood pressure rates in Chicago, IL. High blood pressure is a primary indicator for cardiovascular disease and is the leading cause of death worldwide. While studies have shown a beneficial effect from greenspaces on high blood pressure, the results remain mixed or inconclusive (Zhao et al., 2022). Therefore, it remains important to explore the relationship between high blood pressure and greenspaces, especially quantitative studies.

Study Area¶

Chicago, IL is the third largest city in the United States and is home to more than 2,721,308 people. The Chicago metro area has a larger footprint at around 9.5 million people. It is a very culturally diverse city with a large population of immigrants and a racial makeup of roughly 33-36% White, 29% Black or African American, 29-30% Hispanic or Latino, and 6-7% Asian (U.S. Census Bureau, 2024). While the racial composition of the city appears balanced at the city level, Chicago still experiences heavy residential segregation, containing specific racial-majority areas across census tracts. Chicago is a highly urbanized city that has dense residential communities, transportation infrastructure networks, industrial areas, and urban greenspaces, resulting in significant spatial variation. Chicago is a good case study to explore the relationships between greenspaces and high blood pressure, due to this wide variation across census tracts.

Methods¶

This data analysis explores the relationship between high blood pressure and urban greenspace through metrics such as edge density, mean patch size, and fragmentation derived from Normalized Difference Vegetation Index (NDVI) calculations from the National Agricultural Imagery Program (NAIP) spatial data. NAIP spatial data is collected at a fine spatial resolution of 0.6-meter ground sample distance (GSD) for the purpose of being able to calculate these landscape metrics. The temporal resolution is coarser for NAIP imagery as it is only conducted every few years compared to the temporal resolution of satellite imagery. For this analysis, fine spatial resolution with coarse temporal resolution is suitable as spatial detail is what is needed, to compare with human health data. These NDVI statistics are combined with the 2024 release of the Center for Disease Control (CDC) PLACES dataset. This dataset contains human health data for a variety of chronic health conditions and diseases including high blood pressure, at the census tract scale. These datasets are combined to explore the relationship and potential correlation between urban greenspace and high blood pressure. Due to the quantity of data in this analysis, to save space and memory, the calculations are performed in the cloud rather than downloading the raster data to the computer.. The model uses an Ordinary Least Squares Linear Regression Model, described further below.

STEP 1: Set up analysis¶

In [ ]:
### Import libraries

# File paths and organization
import os
import pathlib

# See warnings
import warnings

# Work with deifferent data types (tabular, vector, raster)
import pandas as pd
import geopandas as gpd 
import numpy as np
import rioxarray as rxr
import xarray as xr
import rioxarray.merge as rxrmerge
import shapely
import time

# Plotting and visuallization packages
import geoviews as gv
import holoviews as hv
import hvplot.pandas 
import hvplot.xarray
from cartopy import crs as ccrs

# Data exploration and visualiztion
import matplotlib 
import scipy.stats as stats
import matplotlib.pyplot as plt

# Image processing 
from scipy.ndimage import convolve 
from scipy.ndimage import label

# Work with data in the cloud
import pystac_client 
from sodapy import Socrata

# Use progress bar
from tqdm.notebook import tqdm

# OLS sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
In [ ]:
# Create reproducible file paths
data_dir = os.path.join(
    # Home directory
    pathlib.Path.home(),

    # Earth Aaalytics
    'Documents',
    'education',
    'earth-data-analytics',
    'spring-2026-data',

    # Project directory
    'hbp-urban-green-big-data'
)

# Make the dir once
os.makedirs(data_dir, exist_ok=True)
In [ ]:
# Revent GDAL from quitting due to momentary disruptions
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

STEP 2: Create a site map¶

In [4]:
### Set up the census tract path

# Define the directory
tract_dir = os.path.join(data_dir, 'chicago-tract')

# Make the directory
os.makedirs(tract_dir, exist_ok=True)

# Make path to directory
tract_path = os.path.join(tract_dir,'chicago-tract.shp')

# Download the census tracts (only once)
if not os.path.exists(tract_path):

    # Add in census url
    tract_url = ('https://data.cdc.gov/download/x7zy-2xmx/application%2Fzip')

    # Read the shp file
    tract_gdf = gpd.read_file(tract_url)

    # Subset to Chicago
    chi_tract_gdf = tract_gdf[tract_gdf.PlaceName == 'Chicago']

    # Save it as a file
    chi_tract_gdf.to_file(tract_path, index = False)

# Load in census tract data
chi_tract_gdf = gpd.read_file(tract_path)

# Show 
chi_tract_gdf

# Site plot -- Census tracts with satellite imagery in the background
( 
    chi_tract_gdf

    # set projection
    .to_crs(ccrs.Mercator())

# Plot with satellite image
.hvplot(
    line_color = 'orange', 
    fill_color = None,
    crs = ccrs.Mercator(), 
    tiles = 'EsriImagery',
    frame_width = 600,
    title = "Chicago, IL Census Tract Map (CDC PLACES Dataset - 2024 Release)"
)
)
Out[4]:

Initial Map Reflection¶

Visually, the greenspace does not appear to be evenly distributed. It appears that many tracts contain a large amount of greenspace or they do not contain much greenspace at all. Many of the tracts appear nearly uniform in the shade of green or grey within their boundaries. Chicago historically has, not unlike many other cities, experienced a history of red-lining, inequitable planning decisions, and differences in funding distribution across the city (Mitchell, 2025). It has been found that the greenspace available in different census tracts correlates to racial majorities and income distribution. Tracts with higher Black and Hispanic populations having less access and availability of urban greenspace. It has also been shown that White-majority census tracts have better access regardless of income whereas income inequality and access to urban greenspace is higher for Black and Hispanic populations (Lui et al., 2021).

Context¶

Redlining, a common practice in the 1930s and 1940s, was when the Home Owners’ Loan Corporation (HOLC) and the Federal Housing Administration (FHA) classified different neighborhoods often based on racial distributions as higher risk or lower risk. Areas with higher populations of black and brown Americans were classified as high risk while white neighborhoods were classified as low risk. This didn't just affect the communities at the time. Researchers have shown that the effects continue to affect funding, home values, property values, healthcare access, and more for the decades to follow (Schwegman, 2025). This could be evidence as to why we see less greenspace in historically Black or Hispanic neighborhoods, as funding for greenspace conservation and parks comes through investment. Underinvestment in the communities that have higher minority populations throughout the last 100 years would lead to less greenspace access, vegetation conservation and parks in those census tracts.

Many areas and census tracts in Chicago are Black or Hispanic-majority, meaning that there are many people without access to urban greenspace which is proven to be beneficial to human health outcomes and overall well-being. According to Lui et al. (2021), since Black and Hispanic-majority communities experience the greatest disparities in income-driven access to greenspaces. Special attention to be paid to Black and Hispanic-majority neighborhoods that are also low income. They are the most disadvantaged group and have historically received less funding due to long-term systemic racial discrimination.

While this data analysis focuses directly on correlations between high blood pressure and urban greenspace metrics, such as edge density, this important contextual background on census tract demographics should be considered in understanding this relationship and clustering.

Data Description & Citation

The following analysis using geospatial friendly census tract data retrieved above from the 2024 release of the Center for Diseas Control and Prevention (CDC) PLACES dataset.

Data Citation: Centers for Disease Control and Prevention. (2024). PLACES: Census tract data [Data set]. U.S. Department of Health & Human Services. https://data.cdc.gov/download/x7zy-2xmx/application/zip

STEP 3 - Access High Blood Pressure and Urban Greenspaces Data¶

Step 3a - Access human health data¶

In [5]:
# Set up a path for the asthma data
cdc_path = os.path.join(data_dir, '.csv')

# Download asthma data (only once) - conditional statement, if it exists do this, if not...
if not os.path.exists(cdc_path):
    
    cdc_url = (
        
        ### url for data
        "https://data.cdc.gov/resource/cwsq-ngmh.csv"

        ### parameters to filter on
        "?year=2023"
        "&stateabbr=IL"
        "&countyname=Cook"
        "&measureid=BPHIGH"
        "&$limit=1500"

    )
    
    cdc_df = (

        # Open as csv
        pd.read_csv(cdc_url)

        # Rename the cols
        .rename(columns = {
            'data_value': 'high_blood_pressure',
            'low_confidence_limit': 'hbp_ci_low',
            'high_confidence_limit': 'hbp_ci_high',
            'locationname' : 'tract'})

        # Select colums to keep
        [[
            'year', 'tract',
            'high_blood_pressure', 'hbp_ci_low', 'hbp_ci_high',
            'data_value_unit', 'totalpopulation',
            'totalpop18plus'
        ]]
    )

    # Download it to a csv
    cdc_df.to_csv(cdc_path, index = False)

# Load in asthma data
cdc_df = pd.read_csv(cdc_path) 

# Preview asthma data
cdc_df
Out[5]:
year tract high_blood_pressure hbp_ci_low hbp_ci_high data_value_unit totalpopulation totalpop18plus
0 2023 17031140800 23.0 20.9 25.1 % 6486 5247
1 2023 17031030500 21.7 19.6 23.7 % 6183 5381
2 2023 17031031502 31.9 29.4 34.3 % 4712 4125
3 2023 17031062700 16.9 15.2 18.5 % 2955 2425
4 2023 17031081800 14.8 13.2 16.5 % 11373 10668
... ... ... ... ... ... ... ... ...
1323 2023 17031833900 39.9 37.2 42.6 % 2333 1598
1324 2023 17031840700 21.3 19.3 23.4 % 3900 2885
1325 2023 17031840800 29.0 26.5 31.5 % 3332 2380
1326 2023 17031832600 16.5 14.7 18.2 % 4147 3442
1327 2023 17031843000 44.6 41.7 47.3 % 2880 2004

1328 rows × 8 columns

Step 3b - Join health data with census tract boundaries¶

In [6]:
# Change tract identifier datatype for merging
chi_tract_gdf.tract2010 = chi_tract_gdf.tract2010.astype('int64')

# Merge census data with geometry
tract_cdc_gdf = (
    chi_tract_gdf
    .merge(cdc_df,
           
           
           # Specify columns to merge on
           left_on = 'tract2010',
           right_on = 'tract',
           
           # Use inner join
           how = 'inner'
           )
)
# Show the gdf
tract_cdc_gdf

# Plot asthma data as chloropleth
(
    gv.tile_sources.EsriImagery

    *

    # Map census tracts with asthma data
    gv.Polygons(

        # reprohect to align CRS
        tract_cdc_gdf.to_crs(ccrs.Mercator()),

        # Select columns
        vdims = [('high_blood_pressure', 'HBP Rate (%)'), ('tract2010', 'Census Tract')], # Added column names for hover

        # Ensure CRSs align
        crs = ccrs.Mercator()
        
    # Add in plot specifications    
    ).opts(color = 'high_blood_pressure', colorbar = True, tools = ['hover'])
    ).opts(width = 600, 
       height = 600, 
       xaxis = None, 
       yaxis = None, 
       title = "Chicago Census Tracts by 2023 High Blood Pressure Rates (%)", # Added title
)
Out[6]:

The spatial distribution appears to have clustering of high blood pressure (HBP) rates, with tracts with higher HBP rates adjacent to one another and lower HBP rates also being clustered. We will explore whether or not we see an association with HBP and landscape metrics such as edge density and mean patch size. Based on visual association alone and comparing the Chicago Racial Dot Map 2020 (2024), the highest rates are very strongly visually associated with Black-majority communities and the lowest rates among White-majority census tracts. Redlining practices, systemic racism, and subsequent underinvestment likely contributes to these higher HBP clusters. This could indicate to urban planners and government officials that access to urban greenspace is a public health concern as well as an environmental justice issue.

Data Description & Citation

A spatial join was completed with Center for Disease Control and Prevention (CDC) data set providing high blood pressure rates for Chicago, IL. The data was filtered for the year 2023.

Data Citation Centers for Disease Control and Prevention. (2024). PLACES: Local data for better health—Asthma [Data set]. U.S. Department of Health & Human Services. https://data.cdc.gov/resource/cwsq-ngmh.csv

In [7]:
# Number of unique tracts
tract_cdc_gdf['tract2010'].nunique()
Out[7]:
788

Step 3c - Get data URLs for urban greenspace imagery¶

NAIP data is freely available through the Microsoft Planetary Computer SpatioTemporal Access Catalog (STAC).

In [8]:
# Connect to the planetary computer catalog
e84_catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
In [9]:
# Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

# Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'chicago-ndvi-stats.csv')

### Check for existing data - do not access duplicate tracts

# Initialize an empty list for the census tract IDs
downloaded_tracts = []

# Write the conditional
if os.path.exists(ndvi_stats_path):

    # If it exists, open the csvs in the path
    ndvi_stats_df = pd.read_csv(ndvi_stats_path)

    # Fill the list with the tract values
    downloaded_tracts = ndvi_stats_df.tract.values

# If it doesn't already exist
else:
    print('No census tracts downloads so far.')
No census tracts downloads so far.
In [10]:
# Run 1 interation 
i = 0
tract_values = tract_latlong_gdf.iloc[i]
tract = tract_values.tract2010
tract
Out[10]:
np.int64(17031010100)
In [11]:
# Check if stats already download for this tract
if not (tract in downloaded_tracts):
    print('proceed with the STAC search')
proceed with the STAC search
In [12]:
# Define naip_search
naip_search = e84_catalog.search(

    # Filter naip data
    collections = ['naip'],

    # Grab 2021 naip imagery
    intersects = shapely.to_geojson(tract_values.geometry),
    datetime = "2021"
)
In [13]:
# Naip_search
list(naip_search.items())[:2]

# Build a dataframe
scene_df = pd.DataFrame(dict(

    # Tract column
    tract = tract,

    # Make date column
    data = [pd.to_datetime(scene.datetime).date()

            ### return the scenes (STAC items)
            for scene in naip_search.items()],

    # Make rgbir_href
    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
))

# Disply the first scene information
scene_df
Out[13]:
tract data rgbir_href
0 17031010100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
In [15]:
### Loop through all the tracts

# Convert geometry to lat/lon for STAC
tract_latlong_gdf = tract_cdc_gdf.to_crs(4326)

# Define a path to save NDVI stats
ndvi_stats_path = os.path.join(data_dir, 'chicago-ndvi-stats.csv')

### Check for existing data - do not access duplicate tracts

# Initialize an empty list for the census tract IDs
downloaded_tracts = []

# Write the conditional
if os.path.exists(ndvi_stats_path):

    ### If it exists, open the csvs in the path
    ndvi_stats_df = pd.read_csv(ndvi_stats_path)

    ### Fill the list with the tract values
    downloaded_tracts = ndvi_stats_df.tract.values

# If it doesn't already exist
else:
    print('No census tracts downloads so far.')

### Loop through each census tract

# Make a list to accumulate into
scene_dfs = []

# Loop through each tract value in the gdf 
for i, tract_values in tqdm(tract_latlong_gdf.iterrows()):

    ### For i, extract the tract id
    tract = tract_values.tract2010

# Check if statistics are already downloaded for this tract
    if not (tract in downloaded_tracts):

        # Repeat up to 5 times in case of a momentary disruption 
         i = 0 # Intialize retry counter
         retry_limit = 5 # Make number of tries 
         while i < retry_limit:

            # Try accessing the STAC
            try:

                # Search for tiles 
                naip_search = e84_catalog.search(
                    collections = ['naip'],
                    intersects = shapely.to_geojson(tract_values.geometry),
                    datetime="2021"
                )

                # Build dataframe with tracts and tile urls
                scene_dfs.append(pd.DataFrame(dict(

                    # Columns called tract
                    tract = tract,

                    date = [pd.to_datetime(scene.datetime).date()
                            
                            for scene in naip_search.items()],

                    rgbir_href = [scene.assets['image'].href for scene in naip_search.items()],
                
                )))

                # Exit retry loop once STAC request succeeds
                break

            # Try again in case of an APIError
            except pystac_client.exceptions.APIError:
                print(
                    f'Could not connect with stack server.'
                    f'Retrying tract [tract]...')
                
                # Wait 2 secs before retrying to not upset the API
                time.sleep(2)
                i += 1
                continue
                

# Concatenate the url dataframes
if scene_dfs:

        # Concatenate
        scene_df = pd.concat(scene_dfs).reset_index(drop = True)

# Otherwise, tell me it didn't work
else:
    scene_df = None
    
# Preview the url dataframe
scene_df
No census tracts downloads so far.
0it [00:00, ?it/s]
Out[15]:
tract date rgbir_href
0 17031010100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1 17031010201 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
2 17031010201 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
3 17031010202 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
4 17031010300 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
1252 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1253 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1254 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1255 17031808100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1256 17031808100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...

1257 rows × 3 columns

In [16]:
# Write out scene_df to csv
scene_df.to_csv(ndvi_stats_path, index=False)

Step 3d- Compute NDVI Statistics¶

Fragmentation statistics:

Percentage vegetation: The percentage of pixels that exceed a vegetation threshold (.12 is common with Landsat) Mean patch size

Mean patch size: The average size of patches, or contiguous area exceeding the vegetation threshold. Patches can be identified with the label function from scipy.ndimage

Edge density: The proportion of edge pixels among vegetated pixels. Edges can be identified by convolving the image with a kernel designed to detect pixels that are different from their surroundings.

In [17]:
# Open df of NAIP imagery URLs
scene_df = pd.read_csv(ndvi_stats_path)

# Display all scene urls
scene_df
Out[17]:
tract date rgbir_href
0 17031010100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1 17031010201 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
2 17031010201 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
3 17031010202 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
4 17031010300 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
... ... ... ...
1252 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1253 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1254 17031807900 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1255 17031808100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...
1256 17031808100 2021-09-08 https://naipeuwest.blob.core.windows.net/naip/...

1257 rows × 3 columns

In [18]:
# Test how many of scene_df are unique
print("Number of tracts:", len(scene_df.tract.unique()))
Number of tracts: 788
In [19]:
# Pull out the first tract value
tract = chi_tract_gdf.loc[0].tract2010
tract

# Covert tract to string
tract = str(tract)
tract
Out[19]:
'17031010100'
In [20]:
# Convert tracts to string
href_value = scene_df.loc[
    scene_df.tract.astype(str) == tract,
                          
    ### Select column to extract
    "rgbir_href"
].iloc[0]

# Disply url
href_value
Out[20]:
'https://naipeuwest.blob.core.windows.net/naip/v002/il/2021/il_060cm_2021/42087/m_4208759_se_16_060_20210908.tif'
In [21]:
### Process one scene

# Open the Raster
tile_da = rxr.open_rasterio(

    # Give it the url selected
    href_value,

    # Mask out the NoData pixels
    masked = True

# Remove the dimensions of length = 1
).squeeze()

# Check it out
tile_da
Out[21]:
<xarray.DataArray (band: 4, y: 12724, x: 9643)> Size: 2GB
[490790128 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 102kB 4.657e+06 4.657e+06 ... 4.65e+06 4.65e+06
  * x            (x) float64 77kB 4.428e+05 4.428e+05 ... 4.486e+05 4.486e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_DOCUMENTNAME:      Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION:  Image courtesy of USDA Farm Service Agency's N...
    TIFFTAG_DATETIME:          2021:10:21 20:26:02
    TIFFTAG_ARTIST:            Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER:      AMDNODE7
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 12724
  • x: 9643
  • ...
    [490790128 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.657e+06 4.657e+06 ... 4.65e+06
      array([4657298.7, 4657298.1, 4657297.5, ..., 4649666.1, 4649665.5, 4649664.9],
            shape=(12724,))
    • x
      (x)
      float64
      4.428e+05 4.428e+05 ... 4.486e+05
      array([442805.1, 442805.7, 442806.3, ..., 448589.1, 448589.7, 448590.3],
            shape=(9643,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 16N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -87.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      GeoTransform :
      442804.8 0.6 0.0 4657299.0 0.0 -0.6
      array(0)
  • TIFFTAG_DOCUMENTNAME :
    Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION :
    Image courtesy of USDA Farm Service Agency's National Agriculture Imagery Program (NAIP) under Aerial Photography Field Office contract 47QTCA18D00J5. Imagery has been placed in the public domain and may be used and reproduced without permission or fee. Please credit 'NAIP imagery provided by USDA Farm Service Agency' on any use.
    TIFFTAG_DATETIME :
    2021:10:21 20:26:02
    TIFFTAG_ARTIST :
    Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER :
    AMDNODE7
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [22]:
# Check the dimensions
tile_da.dims
Out[22]:
('band', 'y', 'x')
In [23]:
# Print the column names
print(scene_df.columns)
Index(['tract', 'date', 'rgbir_href'], dtype='object')
In [ ]:
# Open high blood pressure data spatial aligned with census tracts
tract_cdc_gdf
Out[ ]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract high_blood_pressure hbp_ci_low hbp_ci_high data_value_unit totalpopulation totalpop18plus
0 1714000 17031010100 17 Chicago 1714000-17031010100 4854 POLYGON ((-9758835.381 5164429.383, -9758837.3... 2023 17031010100 32.6 30.1 35.2 % 4905 4083
1 1714000 17031010201 17 Chicago 1714000-17031010201 6450 POLYGON ((-9760143.496 5163888.741, -9760143.4... 2023 17031010201 33.1 30.5 35.7 % 6939 5333
2 1714000 17031010202 17 Chicago 1714000-17031010202 2818 POLYGON ((-9759754.212 5163883.196, -9759726.6... 2023 17031010202 32.6 29.9 35.2 % 2742 2296
3 1714000 17031010300 17 Chicago 1714000-17031010300 6236 POLYGON ((-9758695.229 5163870.91, -9758695.78... 2023 17031010300 32.2 29.6 34.7 % 6305 5606
4 1714000 17031010400 17 Chicago 1714000-17031010400 5042 POLYGON ((-9757724.634 5160715.939, -9757742.2... 2023 17031010400 22.5 20.3 24.5 % 5079 4742
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
783 1714000 17031770700 17 Chicago 1714000-17031770700 0 POLYGON ((-9780753.304 5157066.079, -9780752.0... 2023 17031770700 31.3 28.7 34.0 % 2537 2093
784 1714000 17031770800 17 Chicago 1714000-17031770800 0 POLYGON ((-9783235.84 5154620.343, -9783211.23... 2023 17031770800 28.4 25.8 30.9 % 5661 4457
785 1714000 17031805600 17 Chicago 1714000-17031805600 0 POLYGON ((-9776210.02 5161605.738, -9776213.47... 2023 17031805600 25.9 23.3 28.3 % 4710 3420
786 1714000 17031807900 17 Chicago 1714000-17031807900 0 POLYGON ((-9768609.902 5160576.634, -9768654.5... 2023 17031807900 31.6 28.9 34.2 % 4201 3396
787 1714000 17031808100 17 Chicago 1714000-17031808100 0 MULTIPOLYGON (((-9774480.671 5161127.722, -977... 2023 17031808100 42.2 38.8 45.5 % 4010 3648

788 rows × 15 columns

In [25]:
# Check data type
tract_cdc_gdf["tract2010"].head()
Out[25]:
0    17031010100
1    17031010201
2    17031010202
3    17031010300
4    17031010400
Name: tract2010, dtype: int64
In [26]:
# Another way to check data type
tract_cdc_gdf["tract2010"].dtype
Out[26]:
dtype('int64')
In [27]:
# Check that all are strings
tract in tract_cdc_gdf["tract2010"].astype(str).values
Out[27]:
True
In [28]:
# Get boundary of the tract we're working with
boundary = (
    tract_cdc_gdf

    # Deal with issues with integers & ensure all tracts are strings
    .assign(tract2010 = lambda df: df["tract2010"].astype(str)) ### Assign modifies columns

    # Use tract ID as the index
    .set_index("tract2010")

    # Grab the tract wee're using
    .loc[[tract]]

    # Make the CRS match
    .to_crs(tile_da.rio.crs)

    # Grab the geometry
    .geometry
)

# Display boundary information
boundary
Out[28]:
tract2010
17031010100    POLYGON ((444936.583 4652546.967, 444935.043 4...
Name: geometry, dtype: geometry
In [29]:
# Crop to bounding box
crop_da = tile_da.rio.clip_box(

    # Get coordinates of the bounding box
    *boundary.envelope.total_bounds,

    # Let it expand a little
    auto_expand = True
)

# Display the bounding box info
crop_da
Out[29]:
<xarray.DataArray (band: 4, y: 724, x: 1862)> Size: 22MB
[5392352 values with dtype=float32]
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
  * x            (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_DOCUMENTNAME:      Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION:  Image courtesy of USDA Farm Service Agency's N...
    TIFFTAG_DATETIME:          2021:10:21 20:26:02
    TIFFTAG_ARTIST:            Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER:      AMDNODE7
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 724
  • x: 1862
  • ...
    [5392352 values with dtype=float32]
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.653e+06 4.653e+06 ... 4.652e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4652560.5, 4652559.9, 4652559.3, ..., 4652127.9, 4652127.3, 4652126.7],
            shape=(724,))
    • x
      (x)
      float64
      4.439e+05 4.439e+05 ... 4.451e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([443934.9, 443935.5, 443936.1, ..., 445050.3, 445050.9, 445051.5],
            shape=(1862,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 16N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -87.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      GeoTransform :
      443934.6 0.6000000000000187 0.0 4652560.8 0.0 -0.5999999999997424
      array(0)
  • TIFFTAG_DOCUMENTNAME :
    Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION :
    Image courtesy of USDA Farm Service Agency's National Agriculture Imagery Program (NAIP) under Aerial Photography Field Office contract 47QTCA18D00J5. Imagery has been placed in the public domain and may be used and reproduced without permission or fee. Please credit 'NAIP imagery provided by USDA Farm Service Agency' on any use.
    TIFFTAG_DATETIME :
    2021:10:21 20:26:02
    TIFFTAG_ARTIST :
    Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER :
    AMDNODE7
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [ ]:
# Clip to the exact geometry of the census tract
clip_da = crop_da.rio.clip(

    boundary,

    # Include all pixels touching the boundary
    all_touched = True
)

# Display the exact geometry info
clip_da
Out[ ]:
<xarray.DataArray (band: 4, y: 724, x: 1862)> Size: 22MB
array([[[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]],

       [[nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        ...,
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan],
        [nan, nan, nan, ..., nan, nan, nan]]],
      shape=(4, 724, 1862), dtype=float32)
Coordinates:
  * band         (band) int64 32B 1 2 3 4
  * y            (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
  * x            (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
    spatial_ref  int64 8B 0
Attributes:
    TIFFTAG_DOCUMENTNAME:      Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION:  Image courtesy of USDA Farm Service Agency's N...
    TIFFTAG_DATETIME:          2021:10:21 20:26:02
    TIFFTAG_ARTIST:            Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER:      AMDNODE7
    AREA_OR_POINT:             Area
    scale_factor:              1.0
    add_offset:                0.0
xarray.DataArray
  • band: 4
  • y: 724
  • x: 1862
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]],
    
           [[nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            ...,
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan],
            [nan, nan, nan, ..., nan, nan, nan]]],
          shape=(4, 724, 1862), dtype=float32)
    • band
      (band)
      int64
      1 2 3 4
      array([1, 2, 3, 4])
    • y
      (y)
      float64
      4.653e+06 4.653e+06 ... 4.652e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4652560.5, 4652559.9, 4652559.3, ..., 4652127.9, 4652127.3, 4652126.7],
            shape=(724,))
    • x
      (x)
      float64
      4.439e+05 4.439e+05 ... 4.451e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([443934.9, 443935.5, 443936.1, ..., 445050.3, 445050.9, 445051.5],
            shape=(1862,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 16N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -87.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      GeoTransform :
      443934.6 0.6000000000000187 0.0 4652560.8 0.0 -0.5999999999997424
      array(0)
  • TIFFTAG_DOCUMENTNAME :
    Evanston SE 4208759
    TIFFTAG_IMAGEDESCRIPTION :
    Image courtesy of USDA Farm Service Agency's National Agriculture Imagery Program (NAIP) under Aerial Photography Field Office contract 47QTCA18D00J5. Imagery has been placed in the public domain and may be used and reproduced without permission or fee. Please credit 'NAIP imagery provided by USDA Farm Service Agency' on any use.
    TIFFTAG_DATETIME :
    2021:10:21 20:26:02
    TIFFTAG_ARTIST :
    Surdex Corporation, 636-368-4400, www.surdex.com
    TIFFTAG_HOSTCOMPUTER :
    AMDNODE7
    AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
In [31]:
# Look at the shape of the clipped file
clip_da.shape
Out[31]:
(4, 724, 1862)
In [32]:
# Calculate NDVI based on math formula (NIR - Red) / (NIR + Red)
ndvi_da = (
    (clip_da.sel(band = 4) - clip_da.sel(band = 1))
    / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
    )

# Look at the ndvi results
ndvi_da
Out[32]:
<xarray.DataArray (y: 724, x: 1862)> Size: 5MB
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]],
      shape=(724, 1862), dtype=float32)
Coordinates:
  * y            (y) float64 6kB 4.653e+06 4.653e+06 ... 4.652e+06 4.652e+06
  * x            (x) float64 15kB 4.439e+05 4.439e+05 ... 4.451e+05 4.451e+05
    spatial_ref  int64 8B 0
xarray.DataArray
  • y: 724
  • x: 1862
  • nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan
    array([[nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           ...,
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan]],
          shape=(724, 1862), dtype=float32)
    • y
      (y)
      float64
      4.653e+06 4.653e+06 ... 4.652e+06
      axis :
      Y
      long_name :
      y coordinate of projection
      standard_name :
      projection_y_coordinate
      units :
      metre
      array([4652560.5, 4652559.9, 4652559.3, ..., 4652127.9, 4652127.3, 4652126.7],
            shape=(724,))
    • x
      (x)
      float64
      4.439e+05 4.439e+05 ... 4.451e+05
      axis :
      X
      long_name :
      x coordinate of projection
      standard_name :
      projection_x_coordinate
      units :
      metre
      array([443934.9, 443935.5, 443936.1, ..., 445050.3, 445050.9, 445051.5],
            shape=(1862,))
    • spatial_ref
      ()
      int64
      0
      crs_wkt :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      semi_major_axis :
      6378137.0
      semi_minor_axis :
      6356752.314140356
      inverse_flattening :
      298.257222101
      reference_ellipsoid_name :
      GRS 1980
      longitude_of_prime_meridian :
      0.0
      prime_meridian_name :
      Greenwich
      geographic_crs_name :
      NAD83
      horizontal_datum_name :
      North American Datum 1983
      projected_crs_name :
      NAD83 / UTM zone 16N
      grid_mapping_name :
      transverse_mercator
      latitude_of_projection_origin :
      0.0
      longitude_of_central_meridian :
      -87.0
      false_easting :
      500000.0
      false_northing :
      0.0
      scale_factor_at_central_meridian :
      0.9996
      spatial_ref :
      PROJCS["NAD83 / UTM zone 16N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-87],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26916"]]
      GeoTransform :
      443934.6 0.6000000000000187 0.0 4652560.8 0.0 -0.5999999999997424
      array(0)
In [33]:
### Skip this step if data are already downloaded 
if not scene_df is None:

    ### Get an example 
    # Pull out the identifier for an example tract
    tract = chi_tract_gdf.loc[0].tract2010

    # Subset scene_df to only include that tract
    ex_scene_gdf = scene_df[scene_df.tract == tract]

    # Make an empty list to accumulate into
    tile_das = []

    # Loop through all the images for this tract
    for _, href_s in ex_scene_gdf.iterrows():

        # Open vsi connection to data
        tile_da = rxr.open_rasterio(

            # Point to the location of the raster
            href_s.rgbir_href,
        
            # Deal with no data, remove extra dimension
            masked = True).squeeze()

        ### Crop data to the bounding box of the census tract

        # Make the bounding box
        boundary = (
            tract_cdc_gdf

            # Use the tract ID as index
            .set_index('tract2010')

            # Set the geometry for the tract
            .loc[[tract]]

            # Reproject to crs
            .to_crs(tile_da.rio.crs)

            # Extract the geometry
            .geometry
        )
        # Crop to the bounding box
        crop_da = tile_da.rio.clip_box(

            # Get the coordinates to the bounding box
            *boundary.envelope.total_bounds,

            # Expand it a little
            auto_expand = True
        )
        # Clip data to the boundary of the census tract
        clip_da = crop_da.rio.clip(
            boundary, all_touched = True)

        # Compute NDVI
        ndvi_da = (
            (clip_da.sel(band = 4) - clip_da.sel(band = 1))
            / (clip_da.sel(band = 4) + clip_da.sel(band = 1))
        )

        # Accumulate result
        tile_das.append(ndvi_da)

# Check
print(f"Number of tiles in tiles_da: {len(tile_das)}")

# Merge data
scene_da = rxrmerge.merge_arrays(tile_das)
scene_da

# Mask vegetation
veg_mask = (scene_da > .3)


# Calculate mean patch size
labeled_patches, num_patches = label(veg_mask)
patch_sizes = np.bincount(labeled_patches.ravel())[1:]
mean_patch_size = patch_sizes.mean()

# Check the calculations
patch_sizes
Number of tiles in tiles_da: 1
Out[33]:
array([1716,    1,  288, ...,    1,    1,    1], shape=(3426,))
In [34]:
# Check mean patch size result
mean_patch_size
Out[34]:
np.float64(55.225919439579684)
In [35]:
# Make kernel to calculate edge density
kernel = np.array([
    [1, 1, 1], 
    [1, -8, 1], 
    [1, 1, 1]])

# Calculate edge density (0 = no edge, 1 = edge present)
edges = convolve(veg_mask, kernel, mode='constant')
edge_density = np.sum(edges != 0) / veg_mask.size

# Display the edge density
edge_density
Out[35]:
np.float64(0.11861243479654147)

Repeat for all tracts¶

In [36]:
# Make csv for ndvi data
ndvi_stats_path_veg = os.path.join(data_dir, 'chicago-ndvi-stats-veg.csv')
In [37]:
# Skip this step if data are already downloaded 
if not scene_df is None:

    # Loop through the census tracts with URLs
    for tract, tract_date_gdf in tqdm(scene_df.groupby('tract')):

        ### Open all images for tract
        # Create a list to store NDVI arrays, with 1 array per NAIP image
        tile_das = []

        # Iterate over rows in tract-specific dataframe
        for _, href_s in tract_date_gdf.iterrows():

            # Open vsi connection to data
            tile_da = rxr.open_rasterio(

                # Mask and squeeze
                href_s.rgbir_href, masked = True).squeeze()

            ### Clip data
            # Get tract boundary
            boundary = (
                tract_cdc_gdf
                .set_index('tract2010') # Set index
                .loc[[tract]] # Set the geometry
                .to_crs(tile_da.rio.crs) # Ensure crs match
                .geometry   # Extract geometry
            )

            # Crop to bounding box
            crop_da = tile_da.rio.clip_box(
                *boundary.envelope.total_bounds,
                auto_expand = True
            )
            # Clip to actual tract geometry
            clip_da = crop_da.rio.clip(boundary, all_touched = True)

            # Compute NDVI
            ndvi_da = ((clip_da.sel(band = 4) - clip_da.sel(band = 1))
            / (clip_da.sel(band = 4) + clip_da.sel(band = 1)))

            # Accumulate result
            tile_das.append(ndvi_da)

        # Merge data
        scene_da = rxrmerge.merge_arrays(tile_das)

        # Mask vegetation
        veg_mask = (scene_da > .3)

        ### Calculate statistics and save data to file

        # Count all valid pixels in the tract
        total_pixels = scene_da.notnull().sum()

        # Count all vegetated pixels
        veg_pixels = veg_mask.sum()

        # Identify vegetation patches
        labeled_patches, num_patches = label(veg_mask)

        # Count patch pixels
        patch_sizes = np.bincount(labeled_patches.ravel())[1:]

        # Mean patch size
        mean_patch_size = patch_sizes.mean()

        ### Calculate edge density

        # Define the kernel
        kernel = np.array([
            [1, 1, 1],
            [1, -8, 1],
            [1, 1, 1]])
        
        # Detect boundaries between veg and non-veg
        edges = convolve(veg_mask, kernel, mode = 'constant')

        # Calculate proportion of edge pixels by tract area (normalization)
        edge_density = np.sum(edges !=0) / veg_mask.size
        
        ### Add a row to the statistics file for this tract
        # Create new datafram for the tract and append it to the csv
        pd.DataFrame(dict(

            # Store tract ID
            tract = [tract],

            # Store total pixels
            total_pixels = [int(total_pixels)],

            # Fraction of vegetated pixels
            frac_veg = [float(veg_pixels / total_pixels)],

            # Store mean patch size
            mean_patch_size = [mean_patch_size],

            # Edge density
            edge_density = [edge_density]
            
        # Write out as a csv
        )).to_csv(
            ndvi_stats_path_veg,

            # Append mode
            mode = 'a',

            # Get rid of row numbers
            index = False,

            # Write column names if the csv doesn't exist yet
            header = (not os.path.exists(ndvi_stats_path_veg))
        )
### Re-load results from file
  0%|          | 0/788 [00:00<?, ?it/s]
C:\Users\nymve\AppData\Local\Temp\ipykernel_17296\2508500637.py:66: RuntimeWarning: Mean of empty slice.
  mean_patch_size = patch_sizes.mean()
c:\Users\nymve\miniconda3\envs\earth-analytics-python\Lib\site-packages\numpy\_core\_methods.py:144: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
In [38]:
# Check the file
ndvi_stats_df = pd.read_csv(ndvi_stats_path_veg)
ndvi_stats_df
Out[38]:
tract total_pixels frac_veg mean_patch_size edge_density
0 17031010100 1059033 0.178657 55.225919 0.118612
1 17031010201 1531554 0.213859 57.543394 0.161668
2 17031010202 978546 0.186055 63.260250 0.123673
3 17031010300 1308392 0.191675 57.113642 0.126384
4 17031010400 1516964 0.198563 52.983817 0.079474
... ... ... ... ... ...
783 17031843500 5647650 0.075254 9.732779 0.104686
784 17031843600 1142916 0.054393 9.177296 0.101217
785 17031843700 6025768 0.027644 7.477533 0.047642
786 17031843800 3639014 0.093920 24.169295 0.105052
787 17031843900 4521964 0.199113 23.985215 0.124282

788 rows × 5 columns

STEP 4 - Explore data with plots¶

Chloropleth plots¶

In [39]:
# Merge census data with geometry
chi_ndvi_cdc_gdf = (

    # Grab Census tract gdf
    tract_cdc_gdf

    # Merge with NDVI df
    .merge(
        ndvi_stats_df,

        ### match on the tract ID
        ### left: tract2010
        ### right: tract
        left_on = 'tract2010',
        right_on = 'tract',
        how = 'inner'
    )
)

# Check it out
chi_ndvi_cdc_gdf 
Out[39]:
place2010 tract2010 ST PlaceName plctract10 PlcTrPop10 geometry year tract_x high_blood_pressure hbp_ci_low hbp_ci_high data_value_unit totalpopulation totalpop18plus tract_y total_pixels frac_veg mean_patch_size edge_density
0 1714000 17031010100 17 Chicago 1714000-17031010100 4854 POLYGON ((-9758835.381 5164429.383, -9758837.3... 2023 17031010100 32.6 30.1 35.2 % 4905 4083 17031010100 1059033 0.178657 55.225919 0.118612
1 1714000 17031010201 17 Chicago 1714000-17031010201 6450 POLYGON ((-9760143.496 5163888.741, -9760143.4... 2023 17031010201 33.1 30.5 35.7 % 6939 5333 17031010201 1531554 0.213859 57.543394 0.161668
2 1714000 17031010202 17 Chicago 1714000-17031010202 2818 POLYGON ((-9759754.212 5163883.196, -9759726.6... 2023 17031010202 32.6 29.9 35.2 % 2742 2296 17031010202 978546 0.186055 63.260250 0.123673
3 1714000 17031010300 17 Chicago 1714000-17031010300 6236 POLYGON ((-9758695.229 5163870.91, -9758695.78... 2023 17031010300 32.2 29.6 34.7 % 6305 5606 17031010300 1308392 0.191675 57.113642 0.126384
4 1714000 17031010400 17 Chicago 1714000-17031010400 5042 POLYGON ((-9757724.634 5160715.939, -9757742.2... 2023 17031010400 22.5 20.3 24.5 % 5079 4742 17031010400 1516964 0.198563 52.983817 0.079474
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
783 1714000 17031770700 17 Chicago 1714000-17031770700 0 POLYGON ((-9780753.304 5157066.079, -9780752.0... 2023 17031770700 31.3 28.7 34.0 % 2537 2093 17031770700 5547730 0.022761 7.584359 0.020706
784 1714000 17031770800 17 Chicago 1714000-17031770800 0 POLYGON ((-9783235.84 5154620.343, -9783211.23... 2023 17031770800 28.4 25.8 30.9 % 5661 4457 17031770800 174812 0.138652 6.197392 0.172915
785 1714000 17031805600 17 Chicago 1714000-17031805600 0 POLYGON ((-9776210.02 5161605.738, -9776213.47... 2023 17031805600 25.9 23.3 28.3 % 4710 3420 17031805600 463 0.000000 NaN 0.000000
786 1714000 17031807900 17 Chicago 1714000-17031807900 0 POLYGON ((-9768609.902 5160576.634, -9768654.5... 2023 17031807900 31.6 28.9 34.2 % 4201 3396 17031807900 26883 0.079121 23.898876 0.101617
787 1714000 17031808100 17 Chicago 1714000-17031808100 0 MULTIPOLYGON (((-9774480.671 5161127.722, -977... 2023 17031808100 42.2 38.8 45.5 % 4010 3648 17031808100 44540 0.303839 19.785088 0.154949

788 rows × 20 columns

In [40]:
# Additional testing from when playing around to get different level of rows
clean_df = chi_ndvi_cdc_gdf.drop_duplicates(
    subset="tract2010",
    keep="last"
)
In [41]:
# Additional testing
print("Original rows:", len(chi_ndvi_cdc_gdf))
print("After cleaning:", len(clean_df))
print("Unique tracts:", clean_df["tract2010"].nunique())
Original rows: 788
After cleaning: 788
Unique tracts: 788
In [ ]:
### Plot chloropleths with vegetation statistics
# Make a plot_chloropleth function
def plot_chloropleth(gdf, **opts): 

    # Docstring to describe the function
    """Generate a chloropleth with the given color column"""

    # Use geoviews to make a polygon object
    return gv.Polygons (

        gdf.to_crs(ccrs.Mercator()),

        crs = ccrs.Mercator()

        # Drop x and y axis and add a legend
    ).opts(xaxis = None, yaxis = None, colorbar = True, **opts)
In [43]:
# Apply the function to chicago data
(
    plot_chloropleth(

        # Plot high blood pressure by census tract
        chi_ndvi_cdc_gdf, 
        color = 'high_blood_pressure', 
        cmap = 'viridis', 
        title = 'High Blood Pressure Rates in Chicago')

    +

    # Add a second plot with vegetation edge density
    plot_chloropleth(
            chi_ndvi_cdc_gdf, 
            color = 'edge_density', 
            cmap = 'Greens', 
            title = 'Edge Density in Chicago')
)
Out[43]:

Case Study: Chicago Edge Density vs High Blood Pressure Rates

The comparison of census tracts shows some overlap between tracts with higher HBP rates and higher edge density (a measure of vegetation and habitat fragmentation) that was calculated. Edge density represents how many edges exist between different types of landscape features within a boundary area, in this case the tract. Higher edge density is associated with more fragmented vegetation or habitat. Edge density was normalized to account for the difference in pixel size by taking edge density / vegetation pixels, so that census tracts could be compared. If tracts with higher edge density tend to show higher HBP rates, there could be a link between urban greenspace fragmentation and human health outcomes such as HBP. High levels of habitat fragmentation have a lot of negative outcomes not only for human health but also for ecological health including changes in microclimate, wildlife mobility, & further changes to ecological fragmentation.

It is possible that these tracts may be historically minority areas, which have experienced less public funding over time and perhaps increased industrial development. This may have led to less habitat conservation and a more fragmented environment which would lead to higher edge density and a lack of quality greenspace within the census tract. If there is a connection between edge density and public health, then urban, environmental, and land-use planning are even more important in heavily populated areas.

STEP 5: Explore a linear ordinary least-squares regression¶

Model description¶

Ordinary Least Squares (OLS) Model: Chicago High Blood Pressure Rates & Urban Greenspace¶

To conduct an Ordinary Least Squares (OLS) model analysis, there are several factors that are considered and are assumed to use this model. We assume that the relationship between variables such as edge density and patch size related to the corresponding high blood pressure to be a linear relationship. That the errors (residuals = observed y - predicted y) in our data are normally distributed and that our x variables are not highly collinear. Each census tract is assumed to be spatially independent. Log transformations were used to reduce the skew of the data, make it easier to make inferences from the data, and support the assumption of a normal distribution of error. The relationship between our x and y variables is stationary, or the relationship remains the same throughout Chicago. We removed missing data values to meet the requirement for having complete observations.

The objective of this model is to determine if x variables, such as edge density, can be correlated with high blood pressure. We could evaluate this with R2 and p-values. We use q-q plots to support and check that these errors are normally distributed. We can look at the error distribution spatially to look for clustering.

A potential problem with the OLS model is that there may be less spatial independence between tracts or that the tracts may be spatially correlated. This is something to consider for the analysis and future analysis. The relationship between x and y may not be stationary throughout all of Chicago, meaning that the city may not necessarily be uniform and there are additional factors in different census tracts. Some of the variables may have collinearity with each other, which could make false correlations if not considered.

Step 5a - Data preparation¶

In [44]:
# Check missing values per colum
chi_ndvi_cdc_gdf.isna().any()
Out[44]:
place2010              False
tract2010              False
ST                     False
PlaceName              False
plctract10             False
PlcTrPop10             False
geometry               False
year                   False
tract_x                False
high_blood_pressure    False
hbp_ci_low             False
hbp_ci_high            False
data_value_unit        False
totalpopulation        False
totalpop18plus         False
tract_y                False
total_pixels           False
frac_veg               False
mean_patch_size         True
edge_density           False
dtype: bool
In [45]:
# Check missing values per row
chi_ndvi_cdc_gdf.isna().any(axis = 1)
Out[45]:
0      False
1      False
2      False
3      False
4      False
       ...  
783    False
784    False
785     True
786    False
787    False
Length: 788, dtype: bool
In [46]:
# Check for missing data
chi_ndvi_cdc_gdf.isna().any().any()
Out[46]:
np.True_
In [47]:
# Check sum of the data
chi_ndvi_cdc_gdf.isna().sum()
Out[47]:
place2010              0
tract2010              0
ST                     0
PlaceName              0
plctract10             0
PlcTrPop10             0
geometry               0
year                   0
tract_x                0
high_blood_pressure    0
hbp_ci_low             0
hbp_ci_high            0
data_value_unit        0
totalpopulation        0
totalpop18plus         0
tract_y                0
total_pixels           0
frac_veg               0
mean_patch_size        1
edge_density           0
dtype: int64
In [47]:
# Select variables: Make dataframe with the variables we want for the model
In [48]:
# Visualize distribution of data

# Select variables
variables = ['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density']

# create scatter plot
hvplot.scatter_matrix(

    chi_ndvi_cdc_gdf
    [variables]
)
Out[48]:
In [49]:
### Histograms

### Make facet
fig, axes = plt.subplots(2, 2, figsize = (12, 10))

### List of variables to plot
variables = ['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density']
titles = ['Vegetated fraction', 'Adult HBP rate', 'Mean patch size', 'Edge density']

### Loop through variables and axes to plot histograms
for i, (var, title) in enumerate(zip(variables, titles)):
    ax = axes[i//2, i%2]
    ax.hist(chi_ndvi_cdc_gdf[var], bins = 10, color = 'gray', edgecolor = 'black')
    ax.set_title(title)
    ax.set_xlabel(var)
    ax.set_ylabel('Frequency')

### Adjust layers to prevent overlap
plt.tight_layout()
plt.show() 
No description has been provided for this image
In [50]:
# Drop missing obs
model_df = (
    chi_ndvi_cdc_gdf

    # Make a copy
    .copy()

    # Subset the columns
    [['frac_veg', 'high_blood_pressure', 'mean_patch_size', 'edge_density', 'geometry']]

    # Drop rows (observations) with missing values
    .dropna()
)

# Display the model
model_df
Out[50]:
frac_veg high_blood_pressure mean_patch_size edge_density geometry
0 0.178657 32.6 55.225919 0.118612 POLYGON ((-9758835.381 5164429.383, -9758837.3...
1 0.213859 33.1 57.543394 0.161668 POLYGON ((-9760143.496 5163888.741, -9760143.4...
2 0.186055 32.6 63.260250 0.123673 POLYGON ((-9759754.212 5163883.196, -9759726.6...
3 0.191675 32.2 57.113642 0.126384 POLYGON ((-9758695.229 5163870.91, -9758695.78...
4 0.198563 22.5 52.983817 0.079474 POLYGON ((-9757724.634 5160715.939, -9757742.2...
... ... ... ... ... ...
782 0.020053 27.9 13.674242 0.029123 MULTIPOLYGON (((-9787333.178 5161561.245, -978...
783 0.022761 31.3 7.584359 0.020706 POLYGON ((-9780753.304 5157066.079, -9780752.0...
784 0.138652 28.4 6.197392 0.172915 POLYGON ((-9783235.84 5154620.343, -9783211.23...
786 0.079121 31.6 23.898876 0.101617 POLYGON ((-9768609.902 5160576.634, -9768654.5...
787 0.303839 42.2 19.785088 0.154949 MULTIPOLYGON (((-9774480.671 5161127.722, -977...

787 rows × 5 columns

In [51]:
### Perform variable transformation
# Log of high_blood_pressure rate
model_df['log_hbp'] = np.log(model_df.high_blood_pressure)

# Log of patch size
model_df['log_patch'] = np.log(model_df.mean_patch_size)

# Show df after transformation
model_df
Out[51]:
frac_veg high_blood_pressure mean_patch_size edge_density geometry log_hbp log_patch
0 0.178657 32.6 55.225919 0.118612 POLYGON ((-9758835.381 5164429.383, -9758837.3... 3.484312 4.011432
1 0.213859 33.1 57.543394 0.161668 POLYGON ((-9760143.496 5163888.741, -9760143.4... 3.499533 4.052539
2 0.186055 32.6 63.260250 0.123673 POLYGON ((-9759754.212 5163883.196, -9759726.6... 3.484312 4.147257
3 0.191675 32.2 57.113642 0.126384 POLYGON ((-9758695.229 5163870.91, -9758695.78... 3.471966 4.045043
4 0.198563 22.5 52.983817 0.079474 POLYGON ((-9757724.634 5160715.939, -9757742.2... 3.113515 3.969987
... ... ... ... ... ... ... ...
782 0.020053 27.9 13.674242 0.029123 MULTIPOLYGON (((-9787333.178 5161561.245, -978... 3.328627 2.615514
783 0.022761 31.3 7.584359 0.020706 POLYGON ((-9780753.304 5157066.079, -9780752.0... 3.443618 2.026088
784 0.138652 28.4 6.197392 0.172915 POLYGON ((-9783235.84 5154620.343, -9783211.23... 3.346389 1.824129
786 0.079121 31.6 23.898876 0.101617 POLYGON ((-9768609.902 5160576.634, -9768654.5... 3.453157 3.173831
787 0.303839 42.2 19.785088 0.154949 MULTIPOLYGON (((-9774480.671 5161127.722, -977... 3.742420 2.984929

787 rows × 7 columns

In [52]:
# Visualize transformed variables
hvplot.scatter_matrix(
    model_df
    [[
        'frac_veg',
        'edge_density',
        'log_patch',
        'log_hbp'
    ]]
)
Out[52]:
In [53]:
### Q-Q plots

# Set variables
var_qq = ['frac_veg', 'edge_density', 'log_patch', 'log_hbp']

plt.figure(figsize = (12, 10))
for i, var in enumerate(var_qq, 1):

    # Make a 2x2 fact
    plt.subplot(2, 2, i)

    # Normalize distribution
    stats.probplot(model_df[var], dist = "norm", plot = plt)

    # Add title
    plt.title(f'Q-Q plot fo {var}')

# Plot it
plt.tight_layout()
plt.show()
No description has been provided for this image

Transformations & Plot Selections¶

First we performed a log transformation of the data to reduce the skew of the data and to be able to make better inferences of our data correlations. We chose to do a Q-Q plot to look at whether our data fit using an OLS model with the assumption that the errors are normally distributed. Mean patch size does seem to fit well and edge density is just very lightly curved or just slightly skewed. High blood pressure shows very small s-curve but is likely normal noise in the data. There are strong between variables edge_density and frac_veg, so there could be possible multicollinearity which is something to consider for using the OLS model. Overall, the Q-Q plots suggest that OLS meets most of the required assumptions are met.

Step 5b - Fit and Predict¶

In [54]:
# Select predictor and outcome variables
X = model_df[['edge_density', 'log_patch']]
y = model_df[['log_hbp']]

# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size = 0.33, random_state = 19
)

# Fit a linear regression
reg = LinearRegression()

# Fit our data
reg.fit(X_train, y_train)

# Predict high_blood_pressure values for the test dataset
y_test['pred_hbp'] = np.exp(reg.predict(X_test))

# Real high_blood_pressure rate for comparison
y_test['high_blood_pressure'] = np.exp(y_test.log_hbp)

y_test
Out[54]:
log_hbp pred_hbp high_blood_pressure
295 2.995732 23.869860 20.0
549 3.666122 36.667828 39.1
595 3.858622 36.374828 47.4
378 2.954910 28.591931 19.2
16 3.538057 33.421144 34.4
... ... ... ...
548 3.475067 31.311823 32.3
260 2.833213 28.360528 17.0
504 3.751854 29.439412 42.6
529 3.269569 31.136613 26.3
59 2.944439 28.355779 19.0

260 rows × 3 columns

In [ ]:
# Plot measured vs. predicted high blood pressure prevalence with a 1-to-1 line
y_max = y_test.high_blood_pressure.max()

(
    y_test
    .hvplot.scatter(x='high_blood_pressure', 
                    y='pred_hbp',
                    xlabel = "measured adult high blood pressure prevalence",
                    ylabel = "Predicted adult high blood pressure prevalence",
                    title = "Linear regression performance = testing data")
    .opts(aspect='equal', 
         xlim=(0, y_max), ylim=(0, y_max), 
         width=600, height=600)
) * hv.Slope(slope=1, y_intercept=0).opts(color='black')
Out[ ]:

Step 5c - Explore spatial bias¶

In [57]:
### Compute model error for all census tracts

# Use the trained model to predict high_blood_pressure prevalence for each census tract
model_df['pred_hbp'] = np.exp(reg.predict(X))

# Calculate model error
model_df['error_hbp'] = model_df['pred_hbp'] - model_df['high_blood_pressure']

# Plot error geographically as a chloropleth
(
    plot_chloropleth(model_df, 
                     color = 'error_hbp', 
                     cmap = 'RdBu', 
                     tools = ['hover']) # Add hover legend with information

    # Set the frame width and aspect ratio
    .opts(
        frame_width = 600, 
        aspect = 'equal', 
        title = 'Prediction Error in High Blood Pressure Rates: Chicago Analysis', # Add title
        clabel= "High Blood PRessure Prediction Error")

)
Out[57]:

The errors, either negative or positive, meaning the predictor was over- or underestimated, are clustered together. This might mean that the census tracts are not spatially independent of each other, a requirement for the OLS model. It could mean that something else in those census tracts is affecting high blood pressure in that area, such as pollution, air quality, socioeconomic status, etc. The negative errors appear to be visually correlated with Black-majority census tracts, and the positive errors with Hispanic- or White-majority regions. For future analysis, a Moran’s I test could be run to examine the degree of spatial autocorrelation.

It may be useful to add demographic data (race/ethnicity) or socioeconomic variables (income/education) to this analysis. Other potential metrics include air quality (PM2.5, ozone), industrial density, or housing density. It is possible that another model, such as a spatial linear regression or non-linear regression, may be better suited. There is also an option in scikit-learn for a Box–Cox transformation that may be interesting to explore in future analysis.

Conclusion¶

In this data analysis, the relationship between urban greenspace and adult high blood pressure was explored across Chicago census tracts. Edge density, vegetation fragmentation, and mean patch size were landscape metrics that were visualized and mapped using NDVI calculations from NAIP imagery. Edge density was normalized by vegetation pixels so that the census tracts could be compared. A higher edge density is associated with a more fragmented greenspace, and a higher mean patch size is associated with more intact greenspace. A log transformation was performed on the data to reduce skew and make it better for using the OLS linear regression model. The Q-Q plots showed a good fit for OLS, with just a small amount of noise. In the scatter plots, there may be some multicollinearity between edge density and vegetation fragmentation, so one variable may need to be removed for future analysis to reduce this violation of assumptions for the OLS model. In the final spatial error map (predicted - actual), we observe clustering of both positive and negative values, indicating that some areas are over- or under-predicted. The negative (under-prediction) values are visually overlapped with census tracts associated with Black-majority, and positive (over-predictions) are associated with other census tracts. This might indicate that the census tracts may not be spatially independent of each other and that other factors may be contributing to HBP, such as race, socioeconomic class, industry, or pollution. While there was some visual association between higher edge density and higher rates of HBP, and higher mean patch size with lower rates of HBP, it was not a perfectly linear relationship. The results with the spatial error map suggest that there is more contributing to the high blood pressure than these landscape metrics alone. Future analysis could include variables like race or socioeconomic demographics, better fit models, or explore other data transformations. Further analysis is needed, but greenspace and public health remain an important area for research and exploration to determine the full implications of the relationship.

Citations¶

  • Chicago Racial Dot Map 2020 (2024) ArcGIS. https://www.arcgis.com/home/item.html?id=0efb750426dc4d3381f5385b437e1816&utm_source=chatgpt.com

  • Kondo, M. C., Fluehr, J. M., McKeon, T., & Branas, C. C. (2018). Urban Green Space and Its Impact on Human Health. International Journal of Environmental Research and Public Health, 15(3), 445. https://doi.org/10.3390/ijerph15030445

  • Liu, D., Kwan, M.-P., & Kan, Z. (2021). Analysis of urban green space accessibility and distribution inequity in the City of Chicago. Urban Forestry & Urban Greening, Volume 59(127029). https://www.sciencedirect.com/science/article/pii/S1618866721000546

  • Mitchell, K. (2025, April 21). The distribution of parks in Chicago. ArcGIS StoryMaps. https://storymaps.arcgis.com/stories/883b4a42aa9e4cada0c66f53b0211ba5

  • Schwegman, D. J. (2025). Historical housing discrimination, redlining, and the contemporary distribution of local economic development funding: The case of Chicago. Economic Development Quarterly, 39(2), 75–92. https://doi.org/10.1177/08912424241309321

  • U.S. Census Bureau. (2024). QuickFacts: Chicago city, Illinois. U.S. Department of Commerce. https://www.census.gov/quickfacts/fact/table/chicagocityillinois/PST045224

  • Zhao, Y., Bao, W.-W., Yang, B.-Y., Liang, J.-H., Gui, Z.-H., Huang, S., Chen, Y.-C., Dong, G.-H., & Chen, Y.-J. (2022). Association between greenspace and blood pressure: A systematic review and meta-analysis. Science of the Total Environment, 817, 152513. https://doi.org/10.1016/j.scitotenv.2021.152513